K-groups : tractable group detection on large link data sets
نویسندگان
چکیده
Discovering underlying structure from co-occurrence data is an important task in many fields, including: insurance, intelligence, criminal investigation, epidemiology, human resources, and marketing. For example a store may wish to identify underlying sets of items purchased together or a human resources department may wish to identify groups of employees that collaborate with each other. Previously Kubica et. al. presented the group detection algorithm (GDA) an algorithm for finding underlying groupings of entities from co-occurrence data. This algorithm is based on a probabilistic generative model and produces coherent groups that are consistent with prior knowledge. Unfortunately, the optimization used in GDA is slow, making it potentially infeasible for many real world data sets. For example, in the co-publication domain the MEDLINE database of medical publications alone contains over 2 million papers published within just a 5 year period, 1995-1999 [14]. To this end, we present k-groups an algorithm that uses an approach similar to that of k-means (hard clustering and localized updates) to significantly accelerate the discovery of the underlying groups while retaining GDA’s probabilistic model. In addition, we show that k-groups is guaranteed to converge to a local minimum. We also compare the performance of GDA and k-groups on several real world and artificial data sets, showing that k-groups’ sacrifice in solution quality is significantly offset by its increase in speed. This trade-off makes group detection tractable on significantly larger data sets.
منابع مشابه
Tractable Group Detection on Large Link Data Sets
Discovering underlying structure from co-occurrence data is an important task in a variety of fields, including: insurance, intelligence, criminal investigation, epidemiology, human resources, and marketing. Previously Kubica et. al. presented the group detection algorithm (GDA) an algorithm for finding underlying groupings of entities from co-occurrence data. This algorithm is based on a proba...
متن کاملA novel local search method for microaggregation
In this paper, we propose an effective microaggregation algorithm to produce a more useful protected data for publishing. Microaggregation is mapped to a clustering problem with known minimum and maximum group size constraints. In this scheme, the goal is to cluster n records into groups of at least k and at most 2k_1 records, such that the sum of the within-group squ...
متن کاملGroup detection in complex networks: An algorithm and comparison of the state of the art
Complex real-world networks commonly reveal characteristic groups of nodes like communities and modules. These are of value in various applications, especially in the case of large social and information networks. However, while numerous community detection techniques have been presented in the literature, approaches for other groups of nodes are relatively rare and often limited in some way. W...
متن کاملAnomaly Detection for Astronomical Data
Modern astronomical observatories can produce massive amount of data that are beyond the capability of the researchers to even take a glance. These scientific observations present both great opportunities and challenges for astronomers and machine learning researchers. In this project we address the problem of detecting anomalies/novelties in these large-scale astronomical data sets. Two types ...
متن کاملGroup Anomaly Detection using Flexible Genre Models
An important task in exploring and analyzing real-world data sets is to detect unusual and interesting phenomena. In this paper, we study the group anomaly detection problem. Unlike traditional anomaly detection research that focuses on data points, our goal is to discover anomalous aggregated behaviors of groups of points. For this purpose, we propose the Flexible Genre Model (FGM). FGM is des...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003